Goto

Collaborating Authors

 comparative evaluation


Supplementary Material: Knowledge-based in silico models and dataset for the comparative evaluation of mammography AI for a range of breast characteristics, lesion conspicuities and doses

Neural Information Processing Systems

M-SYNTH is organized into a directory structure that indicates the parameters. Code and dataset is released with the Creative Commons 1.0 Universal License We now review the timing required to perform mass insertion and imaging. In Table 2, we review the imaging time required for each breast density. The time varies from 2.84 GPU), we were able to generate the complete dataset in about two weeks.Breast Density Time (min) Fatty 13.463809 Scattered 11.002291 Hetero 3.655613 Dense 2.842028 Table 2: Timing analysis for imaging by breast density. Additional renderings of the breast phantoms generated for the study are shown in Figure 1, demonstrating a high level of detail and anatomical variability within and among models.


Knowledge-based in silico models and dataset for the comparative evaluation of mammography AI for a range of breast characteristics, lesion conspicuities and doses

Neural Information Processing Systems

To generate evidence regarding the safety and efficacy of artificial intelligence (AI) enabled medical devices, AI models need to be evaluated on a diverse population of patient cases, some of which may not be readily available. We propose an evaluation approach for testing medical imaging AI models that relies on in silico imaging pipelines in which stochastic digital models of human anatomy (in object space) with and without pathology are imaged using a digital replica imaging acquisition system to generate realistic synthetic image datasets. Here, we release M-SYNTH, a dataset of cohorts with four breast fibroglandular density distributions imaged at different exposure levels using Monte Carlo x-ray simulations with the publicly available Virtual Imaging Clinical Trial for Regulatory Evaluation (VICTRE) toolkit. We utilize the synthetic dataset to analyze AI model performance and find that model performance decreases with increasing breast density and increases with higher mass density, as expected. As exposure levels decrease, AI model performance drops with the highest performance achieved at exposure levels lower than the nominal recommended dose for the breast type.


Knowledge-based in silico models and dataset for the comparative evaluation of mammography AI for a range of breast characteristics, lesion conspicuities and doses

Neural Information Processing Systems

Precise mass location and extent (e.g., mass boundaries) are typically not available in the patient's records, and it is burdensome, error-prone, and sometimes impossible to


Supplementary Material: Knowledge-based in silico models and dataset for the comparative evaluation of mammography AI for a range of breast characteristics, lesion conspicuities and doses

Neural Information Processing Systems

M-SYNTH is organized into a directory structure that indicates the parameters. Code and dataset is released with the Creative Commons 1.0 Universal License We now review the timing required to perform mass insertion and imaging. In Table 2, we review the imaging time required for each breast density. The time varies from 2.84 GPU), we were able to generate the complete dataset in about two weeks.Breast Density Time (min) Fatty 13.463809 Scattered 11.002291 Hetero 3.655613 Dense 2.842028 Table 2: Timing analysis for imaging by breast density. Additional renderings of the breast phantoms generated for the study are shown in Figure 1, demonstrating a high level of detail and anatomical variability within and among models.


Capabilities of GPT-5 across critical domains: Is it the next breakthrough?

Georgiou, Georgios P.

arXiv.org Artificial Intelligence

The accelerated evolution of large language models has raised questions about their comparative performance across domains of practical importance. GPT-4 by OpenAI introduced advances in reasoning, multimodality, and task generalization, establishing itself as a valuable tool in education, clinical diagnosis, and academic writing, though it was accompanied by several flaws. Released in August 2025, GPT-5 incorporates a system-of-models architecture designed for task-specific optimization and, based on both anecdotal accounts and emerging evidence from the literature, demonstrates stronger performance than its predecessor in medical contexts. This study provides one of the first systematic comparisons of GPT-4 and GPT-5 using human raters from linguistics and clinical fields. Twenty experts evaluated model-generated outputs across five domains: lesson planning, assignment evaluation, clinical diagnosis, research generation, and ethical reasoning, based on predefined criteria. Mixed-effects models revealed that GPT-5 significantly outperformed GPT-4 in lesson planning, clinical diagnosis, research generation, and ethical reasoning, while both models performed comparably in assignment assessment. The findings highlight the potential of GPT-5 to serve as a context-sensitive and domain-specialized tool, offering tangible benefits for education, clinical practice, and academic research, while also advancing ethical reasoning. These results contribute to one of the earliest empirical evaluations of the evolving capabilities and practical promise of GPT-5.


Intelligent Routing for Sparse Demand Forecasting: A Comparative Evaluation of Selection Strategies

Zhang, Qiwen

arXiv.org Artificial Intelligence

Sparse and intermittent demand forecasting in supply chains presents a critical challenge, as frequent zero-demand periods hinder traditional model accuracy and impact inventory management. We propose and evaluate a Model-Router framework that dynamically selects the most suitable forecasting model-spanning classical, ML, and DL methods for each product based on its unique demand pattern. By comparing rule-based, LightGBM, and InceptionTime routers, our approach learns to assign appropriate forecasting strategies, effectively differentiating between smooth, lumpy, or intermittent demand regimes to optimize predictions. Experiments on the large-scale Favorita dataset show our deep learning (Inception Time) router improves forecasting accuracy by up to 11.8% (NWRMSLE) over strong, single-model benchmarks with 4.67x faster inference time. Ultimately, these gains in forecasting precision will drive substantial reductions in both stockouts and wasteful excess inventory, underscoring the critical role of intelligent, adaptive Al in optimizing contemporary supply chain operations.


Comparative Evaluation of Prompting and Fine-Tuning for Applying Large Language Models to Grid-Structured Geospatial Data

Dhruv, Akash, Xie, Yangxinyu, Branham, Jordan, Mallick, Tanwi

arXiv.org Artificial Intelligence

This paper presents a comparative study of large language models (LLMs) in interpreting grid-structured geospatial data. We evaluate the performance of a base model through structured prompting and contrast it with a fine-tuned variant trained on a dataset of user-assistant interactions. Our results highlight the strengths and limitations of zero-shot prompting and demonstrate the benefits of fine-tuning for structured geospatial and temporal reasoning.


Classification of power quality events in the transmission grid: comparative evaluation of different machine learning models

Güvengir, Umut, Küçük, Dilek, Buhan, Serkan, Mantaş, Cuma Ali, Yeniceli, Murathan

arXiv.org Artificial Intelligence

Automatic classification of electric power quality events with respect to their root causes is critical for electrical grid management. In this paper, we present comparative evaluation results of an extensive set of machine learning models for the classification of power quality events, based on their root causes. After extensive experiments using different machine learning libraries, it is observed that the best performing learning models turn out to be Cubic SVM and XGBoost. During error analysis, it is observed that the main source of performance degradation for both models is the classification of ABC faults as ABCG faults, or vice versa. Ultimately, the models achieving the best results will be integrated into the event classification module of a large-scale power quality and grid monitoring system for the Turkish electricity transmission system.


Knowledge-based in silico models and dataset for the comparative evaluation of mammography AI for a range of breast characteristics, lesion conspicuities and doses

Neural Information Processing Systems

To generate evidence regarding the safety and efficacy of artificial intelligence (AI) enabled medical devices, AI models need to be evaluated on a diverse population of patient cases, some of which may not be readily available. We propose an evaluation approach for testing medical imaging AI models that relies on in silico imaging pipelines in which stochastic digital models of human anatomy (in object space) with and without pathology are imaged using a digital replica imaging acquisition system to generate realistic synthetic image datasets. Here, we release M-SYNTH, a dataset of cohorts with four breast fibroglandular density distributions imaged at different exposure levels using Monte Carlo x-ray simulations with the publicly available Virtual Imaging Clinical Trial for Regulatory Evaluation (VICTRE) toolkit. We utilize the synthetic dataset to analyze AI model performance and find that model performance decreases with increasing breast density and increases with higher mass density, as expected. As exposure levels decrease, AI model performance drops with the highest performance achieved at exposure levels lower than the nominal recommended dose for the breast type.


Energy Price Modelling: A Comparative Evaluation of four Generations of Forecasting Methods

Andrei, Alexandru-Victor, Velev, Georg, Toma, Filip-Mihai, Pele, Daniel Traian, Lessmann, Stefan

arXiv.org Artificial Intelligence

Energy is a critical driver of modern economic systems. Accurate energy price forecasting plays an important role in supporting decision-making at various levels, from operational purchasing decisions at individual business organizations to policy-making. A significant body of literature has looked into energy price forecasting, investigating a wide range of methods to improve accuracy and inform these critical decisions. Given the evolving landscape of forecasting techniques, the literature lacks a thorough empirical comparison that systematically contrasts these methods. This paper provides an in-depth review of the evolution of forecasting modeling frameworks, from well-established econometric models to machine learning methods, early sequence learners such LSTMs, and more recent advancements in deep learning with transformer networks, which represent the cutting edge in forecasting. We offer a detailed review of the related literature and categorize forecasting methodologies into four model families. We also explore emerging concepts like pre-training and transfer learning, which have transformed the analysis of unstructured data and hold significant promise for time series forecasting. We address a gap in the literature by performing a comprehensive empirical analysis on these four family models, using data from the EU energy markets, we conduct a large-scale empirical study, which contrasts the forecasting accuracy of different approaches, focusing especially on alternative propositions for time series transformers.